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Abstract 

A fundamental question of longstanding theoretical interest is to prove the lowest ex- 
act count of real additions and multiplications required to compute a power-of-two discrete 
Fourier transform (DFT). For 35 years the split-radix algorithm held the record by requir- 
ing just 4rt log2 n — 6n + 8 arithmetic operations on real numbers for a size-n DFT, and was 
widely believed to be the best possible. Recent work by Van Buskirk and Lundy demon- 
strated improvements to the split-radix operation count by using multiplier coefficients or 
"twiddle factors" that are not n*'' roots of unity for a size-n DFT. 

This paper presents a Boolean Satisfiability-based proof of the lowest operation count 
for certain classes of DFT algorithms. First, we present a novel way to choose new yet 
valid twiddle factors for the nodes in flowgraphs generated by common power-of-two fast 
Fourier transform algorithms, FFTs. With this new technique, we can generate a large 
family of FFTs realizable by a fixed fiowgraph. This solution space of FFTs is cast as a 
Boolean Satisfiability problem, and a modern Satisfiability Modulo Theory solver is applied 
to search for FFTs requiring the fewest arithmetic operations. Surprisingly, we find that 
there are FFTs requiring fewer operations than the split-radix even when all twiddle factors 
are n*'' roots of unity. 

Keywords: Fast Fourier Transform, FFT, SMT-Solver, Boolean Modeling 
1. Introduction 

In 1965 Cooley and Tukey[14] started a revolution in digital signal processing when they in- 
troduced their fast Fourier transform algorithm (FFT). Their FFT required only 0(n log n) 
addition and multiplication floating-point operations on real numbers, or FLOPs, rather 
than the O(n^) FLOPs required to directly compute a discrete Fourier transform. Al- 
though discovered previously[25][50], it was Cooley and Tukey's timing, which coincided 
with the beginning of widespread use and availability of digital computers, that led to its 
success. The FFT and related algorithms have now found a wide range of application, in- 
cluding electroacoustic music, audio signal processing, medical imaging, image processing, 
pattern recognition, computational chemistry, error correcting codes, spectral methods for 
PDEs and harmonic analysis [5] [50]. 

After the FFT's introduction, there was considerable work on further lowering the FLOP 
count. This was of particular interest since addition and especially multiplication were ex- 
pensive with the computer hardware available at that time. One result that stands out 
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is the work done by Yavne[56] in developing an initial split- radix[18] [54] algorithm with a 
4nlog2 n — 6n + 8 FLOP count for a size-n FFT where n is some power of two, n = 2"*. 
Other important work minimized the number of multiplications but not the total arithmetic 
complexity[28] [29] [17] [55]. In 2004 Lundy and Van Buskirk[39] demonstrated improvements 
to the split-radix operation count by using using constant complex-value multiplier coef- 
ficients or twiddle factors that are not n^^ roots of unity for a size-n DFT. Frigo and 
Johnson[32] generalized Van Buskirk's pioneering work in the context of optimizing the 
conjugate-pair split-radix algorithm[33] . Bernstein[l] then described Johnson's algorithm, 
which is distinct from Van Buskirk's, in terms of algebraic twisting and named it the tan- 
gent FFT. In this paper, we refer to Johnson's algorithm and Bernstein's variation of it, 
differing only in decimation in time versus frequency, as the tangent FFT. Van Buskirk's 
algorithm and the tangent FFT exhibits a modest (~ 5.6%) reduction in FLOP count 
when compared to the split- radix, requiring roughly ^nlog2n. operations rather than the 
previous 4n log2 n — 6n -|- 8. 

This paper presents a proof of the lowest FLOP count for certain classes of DFTs. It is 
beyond the scope of this paper to consider all possible DFTs in our proof. Instead, we focus 
on the common power-of-two complex FFTs and the flowgraphs[10] implied by them. This 
scope still includes a rich set of FFTs as our experiments confirm what others have seen[18]; 
common power-of-two complex FFTs (radix-2, radix-4, decimation-in-time or decimation- 
in-frequency split-radix, conjugate split-radix, classic or any twisted) all exhibit the same 
flowgraph structure (they are graph isomorphisms) but have different twiddle factors as- 
signed to the flowgraph nodes. Furthermore, we restrict our scope to FFTs where twiddle 
factors are n*'* root of unity. This excludes Van Buskirk's algorithm and the tangent FFT, 
but we still can show that other algorithms with lower FLOP count than the traditional 
split-radix exist. 

In 1962, a few years before Cooley and Tukey introduced their FFT, Davis et al. devel- 
oped a machine program for theorem proving[15], now referred to as the Davis-Putnam- 
Logemann-Loveland or DPLL algorithm, which is still at the core of modern Boolean Sat- 
isfiability or SAT solvers. In the past decade, several advances and refinements to DPLL 
have made it practical for larger problems[52][44][21]. New conflict-driven clause-learning 
(CDCL) SAT solvers, which incorporate these recent advances, are now commonly used in 
industry to verify hardware and software correctness. Current SAT research benefits from 
industrial sponsorship and an active community, which organizes conferences and com- 
petitions, creates challenge problems, and defines problem formats[38][40][49]. Recently, 
Satisfiability Module Theories (SMT) generalize SAT beyond binary variables to incorpo- 
rate higher-level theories such as bitvectors, lists and arrays [4 7] [49]. SMT solvers range 
from those that simply reduce a higher-level theory to Boolean logic for a SAT solver, to 
those that extend the core decision procedure to accommodate higher-level theories. 

In this paper, we apply a modern SMT solver to find a lowest FLOP count algorithm for 
the class of FFTs considered. First, we present a novel way to choose new yet valid twiddle 
factors for the nodes in a FFT flowgraph. This technique is more general and leads to a 
richer solution space than the twisting[l] [42], an algebraic way to correctly change the value 
of twiddle factors. This solution space of FFTs is cast as a SAT problem using quantifier- 
free formulas over the theory of fixed-size bitvectors, specified in SMT-LIB format [49], and 
searched with existing SMT solvers [7] [20] [8] [21]. After applying partitioning techniques, we 
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are able to find 6616 FLOP count algorithms for size-256 FFTs, and 15128 FLOP count 
algorithms for size-512 FFTs. These numbers are lower than traditional split-radix, 6664 for 
size-256 and 15368 for size-512, but not as low as achieved by Van Buskirk's algorithm[39] 
or the tangent FFT[32][1], 6552 for size-256 and 15048 for size-512, due to our constraint 
that complex twiddle factors must have modulus one, an absolute value of one. 

Although we supply code for a witness size-256 FFT requiring fewer operations than 
a traditional split-radix[27], we are not addressing algorithm design in this paper. An 
objective to minimize FLOP count is primarily academic given the capabilities of modern 
computing hardware. We use it only as a well-defined and widely-understood objective to 
introduce and demonstrate the power of our formulation and search. We believe the ideas 
presented in this paper can be used to do FFT algorithm design where twiddle factors are 
not all n*^ roots of unity, specific hardware is targeted, or other objectives such as overall 
performance or accuracy are pursued, but these are the topics of a future paper. 

This paper continues with an introduction to the DFT, with emphasis on defining con- 
cepts central and unique to this paper. In Section 2, we present a FFT flowgraph represen- 
tation for generating a family of FFTs. This formulation of the solution space is tailored so 
that it can be easily cast as a SMT problem. Section 3 introduces a first SMT problem for- 
mulation and then develops symmetry reduction and partitioning ideas which allows us to 
solve larger problems. Finally, we conclude with discussion of our experiments and results, 
application to FFT algorithm design, and future work. 

1.1 Definitions 

The DFT (discrete Fourier transform) is a specific kind of Fourier transform whose input 
is a sequence of numbers instead of a function. The sequence of numbers is often obtained 
by sampling a continuous function. Throughout this paper, let n = 2™, let = — 1, and 
let u)n represent the complex n*^ root of unity e~'^~^ . The ra-tuple of complex numbers 
(ao, ai, 0,2, . . . , fln-i) is transformed by the DFT into another n-tuple of complex numbers 
X{k) according to the formula 

n-l 

X{k) = Y: a.oji'. 
j=0 

It is well-known that the complex size-n DFT is a linear operator on C" and can be repre- 
sented as multiplication by an n x n Vandermonde matrix. For our purposes, it is better to 
identify the entries of the n-tuple with the coefficients of the polynomial 

f{x) = ao + aix + a2X^ -!-••• + an-ix^~^ G C[x\. 

Then computing the DFT for a given n-tuple is equivalent to evaluating the polynomial / 
at each of the n*^ roots of unity uj^, for A; = 0, 1, 2, . . . , n — 1. That is, X{k) = /(w^). So 
each output value of the DFT is a weighted sum of the aj, where the weight of aj in X{k) 
is u:i^. 

When an FFT is used to compute a size-n DFT, with twiddle factors of modulus one, 
the product of all the twiddle factors applied to aj in the computation oi X{k) equals the 
weight uj^j^ . We'd like to keep track of the accumulated weight on any given aj through all 
of the intermediate FFT results. To do this, we employ the polynomial view introduced 
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by Fiduccia in [23] and elaborated by Bernstein in [1] and Burrus in [11]. Associating the 
input to a polynomial of degree n — 1 with coefficients aj means that an intermediate FFT 
result is associated to a polynomial of lower degree whose coefficients are weighted sums of 
the ttj. For example, when n = 8 and 

f{x) = ao + aix + a2x'^ + asx^ + 043;^ + a^x^ + a^x^ + ayx^, 

two of the intermediate results of the radix-2 FFT are 

/ mod(x^ - ujg) = (ao + 02 + 04 + 05) + (ai + 03 + 05 + aj)x 

and 

/ mod(x^ - Wg) = (ao - 02 + a^- ae) + (ai - as + 05 - aj)x 

= (ao + a20jj + 04 + oewg) + (ai + asi^jg + 05 + aruig)x. 

Each of the coefficients in the two linear polynomials above is represented by a node in a 
flowgraph, which we describe in the next section. First, we define some characteristics of 
these coefficients. 

Definition 1.1. The base of a coefficient is the aj of lowest index that appears in the 
weighted sum comprising that coefficient. 

Definition 1.2. The stride of a coefficient is the integer difference between the indices of 
any two successive aj in the weighted sum, when the terms of the sum are written with the 
indices in strictly increasing order. When the coefficient consists of a single aj, the stride 
is defined as n, the size of the DFT. 

In the polynomials above, the constant terms have base oq, the linear terms have base 
ai, and all four coefficients have stride 2. If the example above is continued to determine 
/ mod(a;® — 1), each coefficient aj will have base aj and stride 8. For any k, the output 
value X(k) has base oq and stride 1. 

Definition 1.3. The weight stride, Ws, of a coefficient is the integer difference (mod n) 
of the powers put on a;„ to form the weights on any two successive aj in the coefficient, 
when the terms of the coefficient are written with the indices in strictly increasing order. 

The Ws of each coefficient in / mod(x^ — Wg) above is 0. The Ws of each coefficient in 
/mod (j;^ — cjg) above is 4. For / mod(x^ — 1) in the example above, there is no combination 
of the aj comprising any coefficient, so the Ws of each of the eight original coefficients is 

defined to be zero. The Ws for X{k) = aoco^ + aiw^ + 02^^^'^ + • • • + a^Si^'^ is k. 

Definition 1.4. The weight on base, Wi,, of a coefficient in an intermediate FFT result 
is the integer power (mod n) to which Un has been raised to form the accumulated weight 
on the base of the coefficient. 

The Wh of each of the four coefficients in the example, indeed of any coefficient from 
the radix-2 FFT, is zero. To find an example of coefficients with nonzero Wb among the 
common FFTs, we'll consider the size-8 twisted FFT. Given f(x) as in the example, the 
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remainder / mod(x^ + 1) determines the remainder f{ui^x) mod(x^ — 1) as described in [1] 
and [42] . It follows that one of the intermediate results of the size-8 twisted FFT is 

/(cjsx) mod(2;^ - 1) = (oq - 04 + a;|[a2 - 05]) + (^i[ai - as] + ^l[az - a?]) 2; 

= (oo + a20j1 + 04(^8 + aeWg) + (aiWg + 030;! + a^uj^ + aiujl)x. 

So we see that the coefficient of the linear term has base ai and Wb = 1. This Wb as well 
as the other definitions from this section are visually summarized in Figure 1. 

Ws=2 



1 

(aiCUg + aa^l + a^ul + a^uJl)x 



base^a, ^ ^^^^^ 



Figure 1. Definitions 



2. A Flowgraph Representation for 
Generating a Family of FFT Algorithms 

Signal flowgraphs are a widely used formalism to represent and analyze FFTs[10][5]. In this 
section we show how the concepts defined in Section 1.1 occur in flowgraphs of common 
power-of-two FFTs. In particular, we will show that each node in a given flowgraph can 
be labeled with a 3-tuple, {stride, base, Ws), which is an invariant for a family of FFT 
algorithms that can be realized by that given flowgraph. This invariant can then be used 
to generate FFT instances realizable by that flowgraph. 

To facilitate the discussion, we show two example flowgraphs. The first, shown in 
Figure 3, is Gauss' original FFT[25][1]. The second, shown in Figure 5, represents a size-16 
conjugate split- radix as discussed in [32] [33]. 

2.1 Edges and Nodes 

Each directed edge represents the transfer of a complex number, either into or out of a 
node. In an algorithmic implementation of the flowgraph, each edge is indeed a single 
concrete complex number, but for the purposes of our flowgraph analysis, this complex 
number should be thought of symbolically as a weighted sum of the Uj, where the weight 
on any aj is some w*. 

The input operands of the FFT, labeled ao...a„_i, are shown at the top and the output 
values, labeled X{0)...X(n — 1), are shown at the bottom. Unlike traditional FFT flow- 
graphs, we use Uj instead of x{j) for input operands and show data flow top-to-bottom 
instead of left-to-right to facilitate discussion relating this flowgraph to the polynomial 
evaluation perspective of the FFT. 

Each node represents complex addition and/or multiplication operations applied to the 
input operands to generate the output values. Figure 2 shows the internal behavior of a 
node. For nodes with two input edges, the two input operands are added to produce the 
single complex result id, when viewed concretely. We prefer to view id symbolically, as a 
weighted sum of the aj, where the weight on any aj is some a;*. Next, id is separately 
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Figure 2. Node Internal Behavior 



multiplied by two twiddle factors to produce the left and right output values. In Figure 
2 the left and right twiddle factors are shown as u;^-^^ and oJn^^ respectively but in the 
concise node representation as seen in Figure 3 only the integers Itfp (left twiddle factor 
power) and rtfp (right twiddle factor power) are shown in the bottom row of each node. 

We count the cost of multiplication by some twiddle factor uj^^ in the traditional way, 
where the cost of multiplication by 1,— 1,« or —i is free, multiplication by — \f—l 
or —\f^i is 4 floating point operations (FLOPs), and multiplication by any other n*'^ root 
of unity is 6 FLOPs. In addition to the potential multiplication cost, each node always 
requires 2 FLOPs for the cost of the addition. 

There is one interesting multiplication cost exception when iJ'n^'^ and w^*-^^ are complex 
conjugates. In this case, IHe {iJ'n^'^) = ^tn (u)'^-^^) and Dim {uj^-^p) = Vie {oj^^^) so that only 4 
real multiplications and 4 real additions are needed to weig ht by both J^S'p and uj^^fP. In 
this case, we tally 6 FLOPs for the weight by oo^^^'p and only an additional 2 FLOPs for the 
weig ht by u'^^fP. For cases where Itfp = rtfp, we tally FLOPs for uj^^"^ only. 

Two separate multiplications by w^-^^ and uj^^'^ never seen in traditional FFTs as 
it typically leads to higher cost when counting total floating point operations. Instead, one 
multiplication is done and the result may or may not be negated at no additional cost to 
generate the second output. In this paper, we adopt the more general description containing 
two separate multiplications and will later show how constraints can be applied to prune 
the search space to solutions requiring only a single multiplication without detriment to the 
final global FLOP count. 

Dotted nodes in the top row of Figure 3 only have a single input operand and con- 
sequently there is no internal addition. In this case, the input operand is used directly 
as operand id within the node. Nodes in the bottom row of Figure 3 only have a single 
output value and consequently there is just a single internal multiplication. In this case, 
only a single twiddle factor is specified. Again, traditional FFTs often suppress this final 
multiplication as it is typically a cost- free multiplication by 1 or -1. For the generality of 
our first formulation, we always include this final multiplication, but will later prove via 
SMT that it is not required when searching for lowest FLOP count FFTs. 
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Figure 3. Size-8 Radix-2 DIF FFT Flowgraph 



For the class of FFT flowgraphs we are considering, each node has at most two parents 
and two children. We adopt dot notation when it is necessary to refer to attributes of 
a node's parents or children. We refer to a node's left or right parent as nd.lp or nd.rp, 
respectively, for a node nd. Likewise, nd.lc or nd.rc refer to a node's left or right child, 
respectively. With this notation, a node's left parent's left parent twiddle factor can be 
referred to concisely as nd.lp.lp.u^P. Note that tfp can be used here rather than Itfp or rtfp 
as it is clear from the graph context when tracing edges which twiddle factor applies. 

2.2 Flowgraph Properties 

Our flowgraph analysis requires that the following two properties be true, which are checked 
by computer traversal of the flowgraph. 

Property 2.1. There is at most one path from any input operand aj or internal node ndp 
to any output value X{k) or node ndq. 

Definition 2.1. The subset of input operands aj that can reach a node contains that node's 
original ancestors and is denoted as nd.A for a node nd. The subset of output values 
X{k) that is reachable from a node contains that node's terminal descendants and is 
denoted as nd.D for a node nd. 

Property 2.2. For any node nd in the flowgraph, when the elements of nd.A are ordered 

such that indices are strictly increasing, the difference (mod n) of indices on successive 
original ancestors in the list is constant. Furthermore, original ancestors of a node's left 
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parent interleave precisely with the original ancestors of the right parent, and the sets are 
always disjoint. 



Example 2.1. For the node at the end of the third row in Figure 3 labeled with a bold 



when ordered with strictly increasing indices. The integer difference (mod n) of successive 
original ancestor indices is always 2 for this example. For the node's left and right parents, 



which interleave precisely when combined. 

Flowgraphs adhering to these two properties are expected given the divide-and-conquer 
nature of common power-of-two FFTs. We have built flowgraphs of various size-n for radix- 
2, radix-4, decimation-in-time and decimation-in-frequency split-radix, conjugate split-radix 
[32] as well as twisted[l] FFTs, and have always found these properties to be true. For FFTs 
with some radix-4 content, this requires that when adding four numbers, the addition is 
factored into a binary addition tree that observes Property 2.2, which is what is commonly 
done. For twisted FFTs, different twisting functions C Isad to different permutations of 
X(k), but these are isomorphisms of the same flowgraph structure. It is not the point of this 
paper to prove which FFT algorithms generate which flowgraphs. Instead, we observe that 
many common power-of-two FFT algorithms generate flowgraphs that have these properties, 
and we require adherence to develop our flowgraph-based ideas. 

2.3 A Node's base and stride 

In the flowgraph shown in Figure 3, the left number in the middle row of each node is the 
base index for that node's id and the number at the left of an entire row of nodes is the 
stride for any node's id, when id is viewed symbolically as a weighted sum. Figure 4 is a 
key for all flowgraph labels and Figure 2 identifies the internal edge id. 



1.6 



nd.A = {ai, as, 05, 07} 



nd.A = nd.lp.A U nd.rp.A 

= {01,05} U {03,07} 
= {01,03,05,07}, 



rWb 



stride 



base 



Itfp 



rtfp 



FLOPS 



Figure 4. Flow Graph Node Key 
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Definition 2.2. A node's base label, nd.base, is always the index of the base, as defined 
in Definition 1.1, for the weighted sum nd.id represented by the node. The number nd.base 
is the minimum of nd.lp.base or nd.rp.base. For the terminal case when nd has a single 
input operand, nd.base is equal to j for the given input aj. 

To facilitate computation of nd.base, later computation of weight on base, as well as 
impose regularity on the flowgraph, the following property is enforced in flowgraph diagrams 
and computer data structures. 

Property 2.3. For any node nd in a flowgraph, the relation {nd.lp.base < nd.rp.base) is 
always true. 

Definition 2.3. A node's stride label, nd. stride for a node nd, is always the stride, as 
defined in Definition 1.2, for the weighted sum nd.id represented by the node. The number 
nd. stride is the absolute difference (mod n) of nd.lp.base and nd.rp.base. For the terminal 
case when nd has a single input operand, nd. stride is defined to be n for a size-n FFT. 
Property 2.2 ensures that strides are constant and hence a single stride label per node is 
sufficient. 

Example 2.2. From Example 2.1, we know that for the last node in the third row of Figure 
3, 

nd.A = {ai, a^, a^, aj}. 

Ignoring values of applied weights in the flowgraph, the polynomial coefficient represented 
by this node must be of the form 

nd.id = oicjg + aaWg + osWg + ajcol. 

From the discussion in Section 1.1 we can deduce that nd.base = 1 and nd. stride = 2. Also, 
we see that nd.lp.base = 1 and nd.rp.base = 3. By Definition 2.2, 

nd.base = mm{nd.lp.base, nd.rp.base} 
= min{l, 3} 
= 1, 

and by Definition 2.3, 

nd. stride = nd.rp.base — nd.lp.base (mod n) 
= 3-1 (mod 8) 
= 2. 

This node's row in the flowgraph is labeled with 2, the stride. The ffist label in the middle 
row of the node itself is 1, the base. 

2.4 Weight on base 

The weight on base for every node's input edge, as well as that node's weighted sum id, is 
recorded in the flowgraph. Even though Wf, is defined for a true polynomial coefficient in 



9 




Figure 5. Size-16 Conjugate Split-Radix FFT 



Definition 1.4, we record the weight on base for both the left and right input edge before 
the addition since both are required later to determine a node's weight stride. As shown 
in Figure 4, the top row of each node specifies Wh, the integer power (mod n) to which 
cvn has been raised to form the accumulated weight on the base of the weighted sum of aj 
represented by the left input edge. Likewise, rWh represents the same for the right input 
edge. From Property 2.3, we know that after the addition the base of nd.id is the same as 
the base of the left parent and that the addition does not alter weights. Thus, for nd.id 
is equal to the weight on base of the left input edge and there is no need for a separate IWf,. 

Definition 2.4. The weight on base of a node's left input edge is, 

nd.uj^^ = [ndIparent.uj\l^){nd.lparent.oj^^)., 

and likewise for a node nd's right input edge is, 

nd-uj"^^ = (nd.rparent.uj\l^){nd.rparent.uj^''). 

Following from Definition 2.2 and Property 2.3, the weight on base of nd.id is equal to the 
weight on base of the left input edge and both are referenced as PVf,. For the terminal case 
when nd has a single input, Wh is defined to be zero. Finally, note that Wh for all output 
values X{k) is always zero as all X{k) contain a constant term with ao that can only be 
weighted by uj^ in any correct DFT. 
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Example 2.3. Since weight on base is the result of a series of multiplications by various 
roots of unity w* , it can also be viewed as addition (mod n) of the powers on the roots of 
unity. This is illustrated by the path shown with bold edges from input operand 03 to a 
node nd with nd.base = 1 and nd. stride = 2 in Figure 5. Then 

nd.u:[f^=u;ll{ull{uUa3))), 

which is 03 multiplied by all twiddle factors along the path, and can be rewritten as 

rWb = 13 + 12 + (mod 16) 
= 9. 

This rWb, 9, is shown in the upper right corner of the node. Once this path reaches nd, 
we no longer keep track of the weight on 03 as it is no longer the base of the weighted sum 
id. However, it is still essential to keep track of this weight up to this point as it is used to 
compute Wg. 

2.5 Weight Stride 

A node's weight stride label, Ws, is shown at the right of each node's middle row, as seen 
in Figures 3, 4 and 5. 

Definition 2.5. A node's weight stride label is 

nd.Ws = nd.rWf) — nd.Wb (mod n). 

For the terminal case when nd has a single input, Wg is defined to be zero. The number 
nd.Wg is always the Wg as defined in Definition 1.3 for the weighted sum nd.id represented 
by the node. 

Example 2.4. Again consider the node nd at the end of the bold path in Figure 5 where 

nd.Wg = nd.rWb — nd.Wb (mod n) 
= 9-3 (mod 16) 
= 6. 

This Wg, 6, appears as the last label in the middle row of this node. We can now reconstruct 
exactly the weighted sum of coefficient nd.id. For the node we are considering with stride = 
2, base = 1, Wg = 6 and Wb = 3, 

Q Q 11 1 V 1Q 

nd.id = ai^ig + asu^Q + a^uj^Q + ay^ig + ^9^16 + ani^ie + ais^ig + ais'^ie 

Now that a node's Wg is defined, we present a key observation that Wg is invariant 
across all FFTs that can be mapped to the given flowgraph. This invariance is central in 
defining a family of FFTs that can then be searched for desirable members. 

Theorem 2.1. For a size-n FFT flowgraph constructed by any FFT algorithm such that 
Properties 2.1 and 2.2 hold, every node's Wg is an invariant. 
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Proof. Consider an arbitrary node nd in the flowgraph. Next, consider an arbitrary FFT 
output value from this node's terminal descendants, X{k) £ nd.D. For the two input values 
a(nd.base) and a(^nd.base)+{nd.stride) in the Weighted sum X{k), it follows from the definition of 
the DFT that these input values must be weighted as 

„ , kind.base) (mod n) „„ j „ , ,k((nd.base)+(nd. stride)) (mod n) 

0'{nd.base)^n ^^O- a(^nd.base)+{nd.stride)^n 

Hence, the weight stride between these input values is 

weight stride = k{{nd.base) + {nd. stride)) — k{nd.base) (mod n) 
= k{nd. stride) (mod n). 

Since, by Property 2.1, nd is the only contributor of a(^nd.base) and a(^(nd.base)+{nd.stride)) to 
X(k), and they are are bound together by the addition in nd and never will be weighted 
again individually, we have 

nd.Ws = k{nd. stride) (mod n). 
Any other value for nd.Ws would produce an incorrect FFT result. □ 

Example 2.5. Again consider the node at the end of the bold path in Figure 5 with 
nd.Ws = 6, nd.stride = 2 and X(3),X(11) E nd.D. Then, 

nd.Ws = k{nd. stride) (mod n) 
6 = 3x2 (mod 16) 
= 11x2 (mod 16). 

2.6 Canonical Node Labels 

Since the three node labels base, stride and Ws are either defined or proven to be unchanged 
by any applied weight Un, we can now assign a canonical label to each node. For a size-n 
FFT flowgraph, the set of canonical node labels defines a family of FFTs that can be realized 
by that flowgraph. Actual applied weights Un, interpreted as VFf,, distinguish members in 
the family of FFTs. 

Definition 2.6. A node's canonical label is nd{nd.stride,nd.base, nd.Ws). 

Example 2.6. Again consider the node at the end of the bold path in Figure 5. This node is 
labeled nd{2, 1, 6) and is the only node with that label in the flowgraph. The nd.stride = 2 
appears to the left of the row in which nd(2, 1, 6) is found. The nd.base = 1 and nd.Ws = 6 
appear in bold on the node itself. 

2.7 Correspondence to the Polynomial View 

Although our main representation is a flowgraph, we have relied on the polynomial view of 
the FFT to facilitate our discussion. In particular, each edge of the flowgraph represents a 
weighted sum of aj and each nd.id a coefficient of the original polynomial modulo one of 
its factors, also a weighted sum of aj. We highlight with gray background in Figures 3 and 
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x + l x-l x + i x-i x + iVi x - iVi x + y/i x - Vi 
Figure 6. Factor lattice of — 1 



5 the polynomial factor lattice as described by Bernstein in [1] and shown in Figure 6. The 
degree of each polynomial factor is the stride for all flowgraph nodes it contains. The power 
to which iOn is raised to form the constant term in that polynomial is the Wstride for all 
flowgraph nodes it contains. And finally, each node's nd.base is the index of lowest degree 
among the aj used in the weighted sum represented by that node. 

Recall our example for the case n = 8. We associate the sampled data to coefficients of 
a degree 7 polynomial: 

/(x) = ao + aix + a2X^ + a^x^ + 042;^ + a^x^ + a^x^ + ajx'^ . 

This polynomial is an element of C[x], where we have a division algorithm that gives 
f{x) = r{x) modp(x) when f{x) = q{x)p{x) + r(x), for r{x) of lesser degree than p{x). In 
particular, we have f{x) = f{x) mod(x^ — 1) because / has degree strictly less than 8. The 
residue class of / modulo (x* — 1) determines the residue class of / modulo (x^ — 1) and 
modulo (x^ + 1)) as the latter two polynomials are factors of x* — 1. Given a complete 
factorization of x^ — 1 into distinct, irreducible, linear factors as shown in Figure 6, 

we have the following isomorphism of rings: 
C[x]/(x^-l) = C[x]/(x-l) xC[x]/(x + l) xC[x]/(x+f) x • • • xC[x]/(x + \/i) xC[x]/(x-\/i), 

which follows from the Chinese Remainder Theorem, as seen in [19]. 

So an FFT algorithm is finding an element from the product ring that corresponds to 
the given /(x) G C/(x^ — 1). Our flowgraph highlights the path of the inputs through 
the lattice of factor rings and canonical homomorphisms in Figure 7. The intermediate 
polynomials are 
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f{x) mod(x'* — 1) 
f{x) mod(x^ + 1) 
f{x) mod(x^ — 1) 
f{x) mod(x^ + 1) 
/(x) mod(x^ — i) 
/(x) mod(x^ + i) 



(ao + 04) + (ai + a5)x + (02 + a6)x^ + (03 + a7)x^ 
(ao - 04) + (ai - a5)x + (a2 - a6)x^ + (03 - a7)x^ 
(ao + 04 + 02 + ae) + (^i + ^5 + «3 + ^7)2; 



(ao + 04 — a2 — ae) + (ai + as — 03 — a7)x 



(ao - 04) + (a2 - a6)^ + [(ai - 05) + (03 - a7)i]x 
(ao - ai) - (a2 - a6)^ + [(ai - 05) - (03 - a7)z]x. 



Since finding the residue of /(x) mod (x^ + 1) is equivalent to setting x^ equal to —1 = Wg 
in /(x), we see in the coefficients of /(x)mod(x'^ + 1) pairs of the original inputs whose 
indices differ by 4. Viewing these coefficients as weighted sums of the aj, where the aj are 
written with the indices increasing, we note that successive weights change by a factor of 
cjg = —1. Since finding the residue of /(x) mod (x^ + i) is equivalent to setting x^ equal 
to —i = Wg in /(x), we see in the coefficients of /(x)mod(x^ + i) four of the original 
inputs whose indices differ by 2 when listed in increasing order. Viewing these coefficients 
as weighted sums of the aj, where the aj are written with the indices increasing, we note 
that successive weights change by a factor of a;| = —i. 

Example 2.7. In Figure 3, the stride = 4 row has two polynomials highlighted, the left 
labeled Xg — i^g and the right labeled x| — w|. The right polynomial, x| — a;|, has four nodes 
corresponding to the four terms of this new polynomial. The constant term has base = 0, 
the linear term has base = 1, and so on, until the last node with base = 3 represents the 
coefficient of the x^ term in the polynomial. Since these four nodes arise from Xg — Wg, all 
nodes in this polynomial have stride = 4. Finally, since the constant term of the factor 
polynomial is —ojg, written as ujn raised to a power instead of the usual +1, all nodes in 
this polynomial have a Wstride = 4. 

This correspondence exists for the original FFT attributed to Gauss [25]. Twisting as 
described in [1] implies that we use a different factor lattice for x*^ — 1. But it is essential 



/(x) G C/(xS - 1) 



C/(x4- 1) X C/(x4 + l) 



C/(x2 - 1) X C/(x2 + 1) X C/(x2 + i) X C/(x2 - i) 



C/(x- 1) X C/(x + l) X 



X C/(x + \/i) X C/(x 



Figure 7. Factor rings with canonical homomorphisms 
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to remember that our flowgraph analysis to derive canonical labels is independent of any 
twists and will derive the same canonical labels for a size-n flowgraph regardless of what 
twists are applied in the particular FFT used to generate the flowgraph. 
In the case of twisting + 1 to — 1 in a size-8 FFT, we see that 

r(x) mod(x^ + 1) 
q{x){x^ + 1) + r(x) 
q{uJsx){ (ujsx)^ + 1) + r{ujsx) 
q{ujsx) ( ujjx"^ + 1) + r{uj8x) 
{-l)q{uj8x){x^ - 1) + r{uj8x) 
r{x) mod(x^ — 1), 

where x = uj^x. Taking the polynomial view, the element i;"^ — 1 G C[x\ has a factor 
tree isomorphic to that of — 1 G C[x]. Whereas the factor ring C[x]/(x^ + 1) may be 
considered a 4-dimensional vector space over C with basis {1, }, the new factor 

ring C[x]/(x^ — 1) has basis {1, u^x, {uj^x)"^^ (wgx)^}. So the stride and Wstride exhibited 
by each set of highlighted nodes in the flowgraph is preserved. 

2.8 Generating a Family Member FFT Algorithm 

Because Wg is independent of any particular FFT's twiddle factors, we can use it as the 
basis for generating all members of a family of FFT algorithms represented by a given 
flowgraph. A valid FFT can be created by randomly picking integer VFb values for all 
nodes in the flowgraph. Given these choices for Wh, Wg determines values for rWb for 
all nodes in the flowgraph. Next, VFf, and rW^ determine values for all twiddle factors, 
and a unique assignment of twiddle factors distinguishes a member in the family of FFT 
algorithms. Before we present a more formal algorithm for this process, we must first define 
how twiddle factors can be determined from Wh- 

Definition 2.7. Following from Definition 2.4, a node's twiddle factors, uj^^'^ and ujn^^, 
can be determined from Wb'. 

{nd.lp.uj\fP){nd.lp.uj^^) = nd.oo^" 

ndlp.JJP = {nd.uj'^^)/{nd.lp.uj^^). 

This can be expressed as (mod n) subtraction of powers: 

nd.lp.tfp = nd.Wb — nd.lp.Wb (mod n). 

The twiddle factor for a right parent is similarly defined as: 

nd.rp.tfp = nd.rWb — nd.rp.Wb (mod n). 



^ fix) = 
/(wsx) = 
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Example 2.8. Consider the node nd{2,l,6) in Figure 5. For this node, nd.Wb = 3 and 
nd.rWb = 9 are specified. Also, nd.lp.Wb = and nd.rp.Wh = 12 are specified. Hence, 



nd.lp.tfp = nd.Wb — nd.lp.Wb (mod n) 
= 3-0 (mod 16) 
= 3, 

which is the twiddle factor applied to that edge by nd.lp as shown in the Figure. Likewise, 

nd.rp.tfp = nd.rWb — nd.rp.Wb (mod n) 
= 9-12 (mod 16) 
= 13, 

which is the twiddle factor applied to that edge by nd.rp. 



Algorithm 1: How to Generate a Random Member FFT Algorithm 



1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 



Input: Size-n flowgraph with labeled invariants 
Output: Size-n flowgraph with twiddle factors assigned 
foreach nd £ flowgraph do 
if nd. stride ^ n then 

nd.Wb ^ randO (mod n) 
nd.rWb ^ nd.Wb + nd.Wg ( mod n) 
else 
I nd.Wb ^ 
foreach nd £ flowgraph do 
if nd. stride ^ n then 

nd.lp.tfp ^ nd.Wb — nd.lp.Wb (mod n) 
nd.rp.tfp ^ nd.rWb — nd.rp.Wb (mod n) 
if nd. stride = 1 then 
I nd.tfp — nd.Wb (mod n) 



Example 2.9. Figure 8 shows a random member from the family of FFTs realizable by 
a size-8 flowgraph. Consider node red(l,0,3). Since nd. stride ^ n, we assign a random 
integer (mod 8) of 3 to nd.Wb. Following Algorithm 1, 

nd.rWb = nd.Wb + nd.Wg (mod n) 
= 3 + 3 (mod 8) 
= 6. 



This same process is repeated until Wb and rWb have been assigned for all nodes with 
stride ^ n in the flowgraph. Nodes with a single input, shown as dotted, always have 
Wb = following Definition 2.4. Next, actual twiddle factors are computed from the weight 
assignments. Again consider node nd(l,0,3) and the computation of twiddle factors for 
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Figure 8. Size-8 Random FFT Flowgraph 



that node's parents: 

nd.lp.ltfp = nd.Wh — nd.lp.Wb (mod n) 
= 3-1 (mod 8) 
= 2 

nd.rp.lt fp = nd.rWb — nd.rp.Wf, (mod 8) 
= 6-2 (mod 8) 
= 4. 

Since nd has no children nodes (nd. stride = 1) we must compute its twiddle factor as 

nd.tfp = — nd.lp.Wb (mod n) 
= 0-3 (mod 8) 
= 5. 

This process is repeated until all twiddle factors are assigned. 



3. Searching a Family of FFT Algorithms 

In Section 2, a flowgraph representation of common power-of-two FFTs is developed that 
defines an invariant weight stride, Ws, for each node in the flowgraph. Algorithm 1 uses 
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the weight stride invariants to generate a new assignment of twiddle factors and hence a 
unique FFT. Since Algorithm 1 is arbitrary, a rich solution space of valid FFTs, called a 
family, results. In this section, we characterize the size of this family, specify a family as a 
Satisfiability Modulo Theory (SMT) problem, and demonstrate SMT solver-based search of 
this solution space. Although this search can be directed in various ways, we use it to prove 
the lowest total arithmetic complexity (fewest required FLOPs) when all twiddle factors 
are n^^ roots of unity. 

3.1 The Size of a Family 

The solution space of valid FFTs for a given flowgraph is extremely large! 

Definition 3.1. A size-n flowgraph's solution cardinality is 2"^°S2 "^°S2 and is the 

number of valid FFTs realizable by the given flowgraph. By direct examination of Algorithm 
1, each node nd in a size-n flowgraph, where nd. stride ^ n, is arbitrarily assigned some 
integer (mod n) to nd.Wi,. Thus, there are n = 2'°S2'^ possible choices for a single node's 
Wfe. And, since there are nlog2n nodes in the flowgraph where nd. stride ^ n, there are 
(2iog2n)niog2n ^ 2" i°g2 loga n possible assignments of ah Wb in a size-n flowgraph. 

Example 3.1. For a size-256 flowgraph, there are 

solution cardinality = 2^56 log^ 256 log, 256 ^ 2^6384 

valid FFTs possible. One can better appreciate the magnitude of this number when re- 
minded that the estimated number of atoms in the universe is 2^^^ and the current fastest 
supercomputer performs 2^^^ FLOPs per second. Yet this number is very small when com- 
pared to all valid and invalid n*^ root of unity assignments possible for twiddle factors, 
234816 _ Thus, for this size-256 flowgraph, there is just a 1 in 2^^^^^ chance of guessing 
correct twiddle factors. 

Although the solution space is immense, in practice we are only interested in family mem- 
bers with desirable qualities, such as fewest required FLOPs, better precision^-, improved 
performance or ease of implementation on a specific microarchitecture. Consequently, we 
need a way to search this space and find these more desirable family members. 

3.2 A First SMT Formulation 

Because of the way concepts were developed in Section 2, it is straight-forward to model 
Algorithm 1 as an SMT problem. This is best illustrated by considering Listing 1, which 
shows portions of the SMT model in SMT-LIB 1.2 format[49] that is created to find a lowest 
arithmetic complexity instance of a size-16 FFT. After a standard preamble, lines 4 and 5 
declare the external inputs nd{2, l,6).Wh and nd{2, 1, 14). W;,, which are both 4-bit vectors. 
Although not shown in the listing, inputs for all undetermined Wb are included. It is for 
these variables that the SMT solver attempts to find a satisfying assignment. For nodes 
where nd. stride = n, the value nd.Wb is predetermined to be and is declared as a constant. 

1. All family members are exact and do not sacrifice numerical accuracy. Imprecision arises from choice of 
twiddle factors with values very close to zero and consequent fioating-point representation limitations. 
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An example of this is shown in line 9 and corresponds to Algorithm 1 line 6. Next, rWb 
for all nodes is computed via addition of Wb and nd. stride. An instance of this is seen in 
line 11 and corresponds to Algorithm 1 line 4. Note that all addition and subtraction is 
naturally (mod n) given the fixed-size bitvectors in the SMT formulation. Twiddle factors 
for all nodes are computed as illustrated in lines 13 and 14. This corresponds to lines 9-12 
of Algorithm 1 . 

Unlike Algorithm 1, the objective of the SMT model is to find the lowest arithmetic cost. 
For this, we must compute the cost implied by every twiddle factor. Lines 16-18 show the 
computation of cost predicates cO, c4, c6, (0, 4 or 6 FLOPs for multiplication, respectively), 
for the left twiddle factor of nd(4, 1, 12). Not shown in this listing are any necessary predi- 
cates cO, c2, c4, c6 for multiplication cost incurred by the right twiddle factor. Line 19 shows 
cost predicates used in an if-then-else (ITE) tree to compute the multiplication FLOPs 
required by a node. We compute predicates first and then a numeric cost as the predicates 
are useful in defining pruning constraints later. In line 21, a total cost is computed by 
simply adding up all node multiplication costs. Finally, line 22 constrains the total cost to 
be less than or equal to some constant, and line 23 specifies that this multiplication FLOP 
constraint is satisfied. Note that the FLOP count due to a node's addition is constant for 
the flowgraphs under consideration and is not explicitly included in the SMT models. 

1 (benchmark examplel 

2 : logic QF_BV 

3 ... 

4 :extrafuns ( (Wb_2_l_6 Bit Vec [ 4 ] ) ) 

5 : extrafuns ( ( Wb_2_l_14 Bit Vec [ 4 ] ) ) 

6 ... 

7 : formula 

8 ... 

9 (let (?Wb_16_14_0 bvO [4] ) 

10 ... 

11 (let (?rWb_2_l_6 (bvadd Wb_2_l_6 bv6[4])) 

12 ... 

13 (let (?ltfp_4_l_12 (bvsub Wb_2_l_6 ?Wb_4_l_12 ) ) 

14 (let (?ltfp_4_3_12 (bvsub ?rWb_2_l_6 ?Wb_4_3_12)) 

15 ... 

16 (flet ($c0_4_l_12 (= (extract [1:0] ?ltfp_4_l_12 ) bv0[2])) 

17 (flct ($c4_4_l_12 (and (= ( cxt r ac t [ : ] ?ltfp_4_l_12 ) bvO [ 1 ] ) (not $c0_4_l_12 ) ) ) 

18 (flct ($c6_4_l_12 (not (= ( cxt r ac t [ : ] ?ltfp_4_l_12 ) bvO [ 1 ] ) ) ) 

19 (let (?cost_4_l_12 (ite $c6_4_l_12 bv6 [4] (ite $c4_4_l_12 bv4 [4] bv0[4]))) 

20 ... 

21 (let (?totalcost (bvadd ?cost_2_2_l (bvadd ?cost_4_l_12 ? cost_4_3_12 ) ) ... 

22 (flct (Smaxcost (bvulc ?totalcost bv22[4])) 

23 Smaxcost 

24 ) . . . ) 

Listing 1. Sample SMT Code 



In practice, more care is given to the total cost addition seen in line 21. A balanced adder 
tree is constructed, where each add uses only as many bits as required for the worst case. 
Furthermore, following the recursive structure in the FFT apparent from the polynomial 
view, cost for smaller FFTs are computed first and then combined to compute the cost for 
larger parent FFTs. This total cost computation is effectively a pseudo-Boolean constraint, 
and we have tried implementing it as an if-then-else (ITE) tree similar to the ROBDD- 
techniques described in [21]. Our experience is that the adder tree is 2-3 times better 
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in terms of SMT computation time for this particular problem with the Boolector SMT 
solver [7]. We did not implement the sorter-based technique described in [21]. 

To find the lowest arithmetic complexity, the SMT model is repeatedly solved, each 
time with a lower value for the constant seen in line 22. At some point the model becomes 
unsatisfiable and the lowest possible arithmetic complexity is known. Unfortunately, this 
straight-forward implementation does not scale up. For flowgraphs of size-32, the time 
for computing the unsatisfiability of a 455 FLOP solution requires 30 seconds using the 
Boolector solver [7] on a 64-bit Intel Core 17 Linux machine. At size-64 and for unsatisfiable 
cost of 1159 FLOPs, we reach our timeout of 24 hours without determining unsatisfiability. 



3.3 Cost Symmetries 

As formulated so far, the SMT model supports the full range of possible values for each 
twiddle factor since each twiddle factor is modeled as a size-m bitvector. This much infor- 
mation is not necessary for finding the lowest possible arithmetic complexity, and only adds 
to the complexity of the model. Instead, it is possible to express every twiddle factor as 

tfp _ tfp-{tfp (mod "/4)) t/p (mod n/4) 

In this expression, ujI[^ ^^^^ ^^"'^ specifies the quadrant in which co^^ lies and is always a 
free multiplication by or —i. Consequently, the portion of u^^^ that solely contributes 

to multiplication cost can be represented by just a quarter of the n^^ roots of unity and is 
defined as ipj/^ = ojn^ {^'^'^ "A) ^ rj,^ simplify the SMT model and upcoming partitioning, we 
suppress the quadrant rotations, u/J^ ^^^^ {awd /4))^ only reason with if^n- 

Multiplication of two tpn is well defined and can be expressed as modular arithmetic. 
Consider the multiplication 

, a+h (mod n) _ , a, b 

which can be re-expressed as 

a+b—{a+b (mod "/4)) (mod n) l a+b (mod "/4) (a (mod "A)),/,'*/ (mod "/4)) / b 

If all ojn specifying quadrant rotations are ignored, multiplication of two tpn is just 

^a+Kmod./4)^^a^fe^ 

which can be expressed easily using modular arithmetic in the SMT model. 

From the bitvector perspective, suppressing ojn quadrant rotations means that the two 
most significant bits of every weight on base, Wb or rWf,, need not be included in the SMT 
model. The SMT solver finds a satisfying assignment for all but the two most significant 
bits of every weight on base. The two most significant bits are then picked at random as 
done in Algorithm 1 without altering cost. In the end, all bits must be assigned to realize 
a correct FFT. 

Eliminating these cost symmetries in the SMT model reduces a size-n flowgraph's solu- 
tion cardinality to 2"'^°^2 "■((iog2 ")-2)_ Yoi a size-256 flowgraph, this is a substantial reduction 
in the size of the solution space from 2^^^^^ to 2^-^^^^. Computation time for proving that 
a size-32 flowgraph has no solution with total cost equal to or less than 455 FLOPs is now 
27 seconds. The timeout of 24 hours is still reached for a size-64 flowgraph constrained to 
1159 FLOPs. It is possible that the SMT computation time improves only modestly since 
the SMT solver is detecting these cost symmetries without explicit help. 
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3.4 Butterflies 



The next three techniques to simplify and partition the SMT model require reasoning with 
FFT butterflies. Although FFT butterflies are a well established idea, we define and review 
concepts relevant to our SMT model. 

Deflnition 3.2. A size-g butterfly is any subgraph of a size-n FFT flowgraph that is 
graph isomorphic to a size-g FFT flowgraph where q <n. A. butterfly's canonical label is 
bf{nd. stride, nd.base, nd.Wg, q) for the single node nd G bf such that nd. stride, nd.base and 
nd.Ws are less than or equal to the stride, base and Wg of any other node in the butterfly. 
As with all FFT flowgraphs considered in this paper, q must be some power of two. 

Example 3.2. In Figure 5, the butterfly 6/(1,0,0,2) contains the four nodes nd{l,0,0), 
nd{l,0, 8), nd(2, 0, 0) and nd(2, 1,0). The expected traditional butterfly structure is clearly 
seen with the node used for identification, nd(l,0,0), at the bottom left. This same node, 
nd{l,0,0), is also used to identify the size-4 butterfly 6/(1,0,0,4) which contains 12 nodes 
and is also clearly visible. Less obvious are small butterflies that appear toward the top 
of the flowgraph such as 6/(4,2,0,2) which contains the four nodes n(i(4,2,0), n(i(4,2,8), 
nd(8, 2, 0) and nd{8, 6, 0). The larger butterfly 6/(4, 2, 0, 4) which contains 12 nodes can also 
be traced with nd(4, 2, 0) anchoring the bottom left corner. Finally, the entire flowgraph in 
Figure 5 can be denoted as 6/(1, 0, 0, 16). 

It is also useful to refer to nodes in an arbitrary butterfly not by canonical node label 
but by relative position. To facilitate this, one can view the nodes of a butterfly as forming 
a matrix and use matrix row, column indexing to refer to a specific node, ndr,c- For example, 
for any size-2 butterfly, the top-left corner node is ndo^Q, the top-right corner node is ndo,i, 
the bottom-left corner node is ndi^o, and the bottom-right corner node is ndi^i. 

Property 3.1. For size-2 and size-4 butterflies in the flowgraph, all Wg for nodes in the 
same row are congruent modulo "/4. The value to which they are all congruent modulo "/4 is 
referred to as ndr^^.Ws- This property arises from the correspondence of Ws in a flowgraph 
to the polynomial view as described in Section 2.7. 

Example 3.3. Consider the size-4 butterfly 6/(1,0,3,4) from Figure 5. By inspection. 



3.5 Shared Twiddle Factors 

Our formulation permits two multiplications, by uj^^^ and ijJ^^^^, per node in the FFT 
flowgraph. Although this generality may be useful for some algorithms, we show here that 
it is not needed when minimizing the total FLOP count is the objective. In fact, it only 
increases the complexity of the SMT model. 

Theorem 3.1. For any size-2 butterfly, 6/i, such that left and right twiddle factors 
are unshared per node in row {ndo^cdtfp ^ ndQ^c-ftfp) but shared per node in row 1 



ndo^^.Ws = {12,12,12,12} 
ndi,,.Ws = {6,6,14,14} 
nd2,*.Ws = {3,11,7,15} 



(mod 

2 (mod "■/4) 

3 (mod 
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{ndi^c-ltfp = ndi^c-fifp) there exists another size-2 butterfly, 6/2, such that left and right 
twiddle factors are shared per node for all nodes, that realizes all final weighted sums X{0) 
and -'^(l) possible by bfi. Furthermore, no bfi exists with lower FLOP count than some 
6/2. 

Proof. The proof is in two parts. First, we prove the existence of three different 6/2. 
Consider the computation performed by 6/1, 

X(0) ^ i^f'^'-'^^ao^jf''"''^'' + aiV'n'"'""^^) (mod "A), (la) 
X(l) ^ i^f'^'-'^^ao^pf'-^-^'^" + aii;f'-'-''^^) (mod n/4). (lb) 

Because of Property 3.1 and Theorem 2.1, ndo,* left and right twiddle factors are related 
by a common weight stride, 

:ndo,i.Wi, lUdo.i.ltfp indo,i.Wb,ndo,i.rtfp 
,ndi,,.Ws ^ Vn ' Wn ^ Wn ' Wn ' , s 

~ jridofi-Wi, indofi.ltfp ~ ,n(io,0-Wi, .ndofi-rlfp ^ ' 

This simplifies to 

.ndo^iMfp indo^i.rtfp 

yn Wn fmod "A) f2l 

.ndofi.ltfp ~ . ndofi.rtf p V / \ ) 

Wn Wn 

which directly relates left and right twiddle factors for nodes in row of bfi. 

Equations 1 and 2 can be used to derive three different 6/2, labeled 6/2^1 fe/2B and 6/20- 
Butterfiy 6/2^, with ndofl.lftp = ndofi.rtfp = 0, is created by multiplying Equation la by 

ndQQ.ltfp ndQQ.rtfp 

1 = Ido^oMfP and Equation lb by 1 = '''"do^^.rtfp , 

XiO) . (^"^-*^^^n'^-^*^^)(ao%^ + ai^^) (mod n/4) 

Wn Wn 
.ndofi.rtfp .ndo^i.rtfp 

Xil) . (^"^-*^^^f-'^*^^)(ao%^ + ai^^) (mod V^)- 

Wn Wn 

After simplification of twiddle factors, the two twiddle factors applied to oq are now shared 
(V'^) and the two twiddle factors applied to oi are also shared due to Equation 2. 

Butterfiy 6/2,6, with ndofi.rtf p made equal to ndofi.ltfp, is created by multiplying Equa- 

tiOn lb by 1 = ndQ Q.Ufp ndg g.rtfp ) 

X(0) EE V^^'^^-°-*^^(aoVn''''''-'*^^ + aiVn''''^'*^^) (mod n/4) 

indofi.rtfp .ndo,o-ltfp indofi.rtfp ,ndofi.ltfp ^ndos-rtfp 

/ indii.tfpWn N/ Wn Wn , Wn Wn n / i / \ 

' \Mtfp )(^0 -r^3^, +«1 -^^d^p )(mod"/4). 

tyn lyn ifn 

Again from direct inspection and application of Equation 2, every node shares left and right 
twiddle factors. 
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Butterfly 6/3C, with ndQfl.lt fp made equal to ndofl.rtf p, is created by multiplying Equa- 

tion la by 1 = ndQ0.lt fp ndQQ.rtfp: 
^n ' i^n ' 

j ndo^Q.ltfp j ndQfi.ltfp I ndQfi.rtf p , ndQ^i.Ufp , ndo^Q.rtfp 
X(0) ^ {^n ' .ndQ,Q.rtfp )(^0 J^^d^^p + «1 ,ndQ,Q.Hfp ) (^O^ n/,) 

X(l) ^ Vn'^'"*^"(ao^n'°'"^*^" + aiVn'°'"^*^") (mod V4). 

Here, too, every node shares left and right twiddle factors. 

Second, exhaustive search with SMT is used to prove that no bfi exists with lower FLOP 
count than some 6/2. The SMT-based proof is a miter between 6/1 and and 6/20- 

The bfi side of the miter is a size-2 FFT modeled in SMT as described in Section 3.2. Ad- 
ditional constraints that ndi fl.lt fp = ndifl.rtf p and ndi^i.ltfp = ndi^i.rtfp are added for 
bfi. The 6/2 side of the miter includes models for all three cases A, B and C. These are 
also modeled in SMT as described in Section 3.2 but with the additional constraint that 
each node has just one twiddle factor, tfp. Furthermore, the constraints bf2A-ndofl = 0, 
bf2B-ndofl.tfp = bfi.ndofl.ltfp, and bf2c-ndQfl.tfp = ndofl.rtfp are included with the re- 
spective 6/2 models. Input values aj with arbitrary initial weights on base ndo^c-Wf, and 
row weight strides ndi^^.Ws are common to all bfi and 6/2. The free variables decided by 
the SMT solver include these common initial weights on base and row weight strides as well 
as weights on base per node for all row 1 nodes in bfi and 6/2. FLOP counts for bfi, 6/2A, 
bf2B and 6/2C1 are individually and explicitly tallied within the SMT model. The question 
posed to the SMT solver is to find a bfi with lower FLOP count than 6/2A) or 6/20- 
The theorem is proved for some n if the SMT solver returns unsatisfiable. The proof can 
be run once for every size-n FFT fiowgraph under consideration, or induction can be used 
to establish the result for n + 1 and higher. □ 

Example 3.4. Consider the concrete computation performed by some bfi from a size-16 
FFT fiowgraph expressed as, 

X(0) = ^Uaoi^w + aiV'?6) (mod -/4) 
X{1) = i^^iaotpfe + ai-iAie) (mod "/4). 

By substituting into Equation 2, we see that this is a valid butterfiy with weight stride 
adhering to Property 3.1, 



iplg (mod "/4). 



Some sharing can occur during the complex multiplication of oq and ai with these left 
and right twiddle factors since 9^e('0i6) = 3m{ipiQ) and lHe(V'i6) = ^^(V'le)- Hence, the 
multiplication FLOP count for 6/1 is only 16 = 8-|-8-|-0-|-0. 
Butterfiy bf2A has only a V'le twiddle factor applied to oq. 



X{0) = ^Ple{ao7p% + ai?/^?^) (mod 
X{1) = ^feiaoi^w + ai4^w) (mod n/^). 
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Note that the twiddle factors appHed to oq and ai, Tpi^, are shared for X{0) and X{1). The 
results for X{0) and X(l) are still equivalent to bfi as the final twiddle factors, ndi^c-tfp, 
are now adjusted by factoring out -^jg and ipf^ respectively. The total multiplication cost 
for 6/2A is 16 = + 4 + 6 + 6, which is the same as bfi. 

Butterfly 6/25 has only a V'le twiddle factor applied to oq, 

X{0) = V^eCaoV'le + aii^fe) (mod "/4) 
X{1) = V'^eCazV'le + oi^/'Je) (mod "A). 

The "i/^fg is factored out of the sum in X{1) to maintain equivalence with bfi. The total 
cost for 6/2,8 is also 16 = 6 + 6 + + 4. 

Butterfly 6/2C has only a tpi^ twiddle factor applied to oq, 

X{0) = i^leiaotpfe + aiV^ig) (mod n/^) 
X{1) = V'?6(a2V'?6 + aii^w) (mod "A). 

The ipiQ is factored out of the sum in X{0) to maintain equivalence with bfi. The total cost 
for 6/2C is also 16 = 6 + 6 + 4 + 0. Although all butterflies in this example have the same 
multiplication cost, this is not always the case in general. The SMT portion of the proof of 
Theorem 3.1 shows that at least one case of 6/2 will have FLOP count less than or equal 
to bfi. 

The definition of bfi in Theorem 3.1 requires that twiddle factors be shared per node in 
row 1, ndi^c-^tfp = ndi,c.rtfp. Butterflies meeting this constraint only occur at the bottom 
of the FFT flowgraph, where a single weight may be applied to some X{k). But after 
applying Theorem 3.1 to all terminal size-2 butterflies in the bottom row, we now have 
shared twiddle factors in the next to the bottom row of the FFT flowgraph. Therefore, 
Theorem 3.1 can be applied iteratively to the entire flowgraph, starting from the bottom 
and proceeding to the top, so that all nodes have a single twiddle factor, tfp, without any 
FLOP count penalty. 

Property 3.2. For any size-2 butterfly from a FFT flowgraph, if all nodes have a single 
shared twiddle factor t/p, then ndifl.Wi, = ndi^i.W^, (mod "/4). This is because ndi^ and 
ndi^i both have the same left parent with the same tfp (mod "/4) applied. 

Given Property 3.2, Algorithm 1 can now be updated so that the SMT formulation 
assigns a weight on base per size-2 butterfly and not per node. Instead of assigning a 
random Wb per node as seen in line 3 of the algorithm, a random Wh is assigned per bottom 
two nodes of every size-2 butterfly. This reduces the number of free Wfe variables by half 
and substantially speeds up the SMT-based search. Shared twiddle factors in the SMT 
model reduce a size-n flowgraph's solution cardinality to 2 2 "^{(iog2 ")-2)_ a size-256 
flowgraph, this further reduces the solution space to 2®^^^. Computation time for proving 
that a size-32 flowgraph has no solution with total cost less than or equal to 455 FLOPs is 
now 3.5 seconds. The timeout of 24 hours is still reached for a size-64 flowgraph constrained 
to 1159 FLOPs. 
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3.6 Partitioning 



Every SMT model formulated so far has been monolithic, and it has been computationally 
difficult to prove the lowest arithmetic complexity for any FFT larger than size-32. In this 
section, we show that analysis of butterflies at the top and bottom of the flowgraph can be 
used to partition larger FFTs into several smaller SMT models that can be solved. This 
analysis is facilitated by explicitly writing out the final weight on base computations, with 
all operations congruent (mod "/'i), for an arbitrary size-4 butterfly: 
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All weights on base internal to the butterfly have been eliminated by repeated substitution. 
All weight strides for nodes in the same row are congruent due to Property 3.1. It is 
instructive to trace all 16 paths from an input operand to an output value for a size- 
4 butterfly and verify that the weight on base computation for that path is included in 
Equation 3. 



3.6.1 Partitioning Using Original Butterflies 

At the top of a flowgraph, all Oj have a weight of 1, uj^. Butterflies that have input values 
which are some of these original aj are called original butterflies. Analysis of original 
butterflies can exploit this known weight on aj to partition the FFT flowgraph and hence 
the SMT model. 



Property 3.3. The weight stride for all nodes in any butterfly that includes only nodes 
belonging to / mod x* — 1, x* + 1, x* — i and x* + i from the polynomial factor tree is 
congruent to (mod "-/i). This follows from the weight stride relationship to the polynomial 
view established in Section 2.7. 
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Example 3.5. Consider the size-4 original butterfly 6/(4,2,0,4) from Figure 5. By in- 
spection, Ws for all nodes in this butterfly {0,4,8,12} is congruent to (mod 4). All size-4 
original butterflies, and some larger, exhibit Property 3.3. 

Theorem 3.2. For any arbitrary size-4 original butterfly, bfi, there exists another size-4 
butterfly, 6/2, which has zero-cost twiddle factors for nodes in rows and 1, such that all 
realizable flnal weighted sums X{k) of bfi can be realized by 6/2. Furthermore, no bfi 
exists with lower FLOP count than this 6/2. 

Proof. The proof is in two parts. First, to prove all realizable flnal weighted sums of 
bfi can be achieved by 6/2, we substitute for all initial weights (ndo,*.VFft = 0), for all 
twiddle factors in rows and 1 (n(io,*.t/p = ndi^^.tfp = 0) and for all weight strides 
{ndi^^.Wg = nd2,>fWs = 0), into the expressions from Equation 3: 
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By direct inspection we establish that any flnal weight on base can be realized by twiddle 
factors of nodes only in the last row. 

Second, exhaustive search with SMT is used to prove that no bfi exists with lower FLOP 
count than its 6/2. The SMT proof is a miter that includes bfi and 6/2. The 6/2 side of 
the miter is a direct translation to SMT of the flnal weight on base computations just seen. 
The bfi side of the miter is created by substituting for all initial weights (n(io,*.VFb = 0) 
and for all weight strides {ndi^^:.Ws = nd2^*.Ws = 0) in the expressions from Equation 
3. Final weights bfi.X(k).Wb are required to be equivalent to corresponding flnal weights 
bf2-X(k).Wb. FLOP counts for bfi and 6/2 are individually and explicitly tallied within 
the SMT model. The question posed to the SMT solver is to flnd a bfi with lower FLOP 
count than 6/2. The theorem is proved for some n if the SMT solver returns unsatisflable. 
The proof can be run once for every size-?i FFT under consideration, or induction can be 
used to establish the result for n + 1 and higher. □ 



26 



This theorem appears to conflict with decimation-in-time FFTs, such as shown in Figure 
5, where costly twiddle factors appear in the first two rows of the FFT. Consider the 
size-4 butterfly 6/(4,2,0,4) from Figure 5. There is multiplication cost at internal nodes 
nd{8,2,8) and nd{8,6,8). But the twiddle factor, nd(8, 2, 8).a;fg can be factored out and 
pushed down to the children nodes nd{4,2,4) and nd(4,2, 12). Likewise, an cj^g must also 
be factored out of nd{8, 6, 8) to maintain algebraic correctness. After factoring out the wfg, 
all multiplication cost occurs on the bottom row of 6/(4,2,0,4) and the total cost remains 
24 FLOPs. Globally, there is now no size-4 original butterfly with cost in the first two rows. 

Because of Theorem 3.2 and the recursive structure of the FFT, we can now partition 
the FFT flowgraph when solving for minimum total arithmetic complexity. In general, we 
must solve for all FFTs corresponding to / mod x* —i and x* +i branches in the factor tree. 
For a size-n FFT, this requires solving SMT models for pairs of size-p butterflies, for all p 
from 1 up to "/4. In practice, for values of p = 8 the problem becomes trivial and is used as 
the terminal case of partitioning. The most difficult partition of a size-n FFT flowgraph, 
a size-^ butterfly, will have a solution space of 2 8 ^°S2 4 ((iog2")-2)_ more concrete terms, 
the largest SMT models required to solve a size-256 flowgraph are for two size-64 butterflies 
corresponding to the / mod x^^ — i and x^^ -|- i branches of the factor tree. One of these 
size-64 butterflies has a solution space of 2^^^^. 

Computation time for proving that a partitioned size-64 FFT flowgraph has no solution 
with total cost equal to or less than 1159 FLOPs is now 2.8 seconds. Our timeout of 24 
hours is reached when attempting to prove that a size-128 FFT flowgraph has no solution 
with total cost equal to or less than 2824 FLOPs. 

3.6.2 Partitioning Using Terminal Butterflies 

At the bottom of a flowgraph, the weight on base required for each final result X{k) is 
known. This enables analysis of terminal butterflies, or butterflies producing some final 
values of X{k), so that the model may be further partitioned. 

Theorem 3.3. For any arbitrary size-4 terminal butterfly, 6/1, there exists another size-4 
butterfly, 6/2, which has zero-cost twiddle factors for nodes in rows 1 and 2, such that all 
realizable final weighted sums X{k) of 6/1 can be realized by 6/2. Furthermore, no 6/1 
exists with lower FLOP count than this 6/2. 

Proof. The proof is in two parts. First, to prove all realizable final weighted sums of 6/1 can 
be achieved by 6/2, we substitute for all final weights {X{k).Wh = 0) and for all twiddle 
factors in rows 1 and 2 {ndi^-^.tfp = nd2,*-tfp = 0) into the expressions from Equation 3: 
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Rows have been reordered to group common twiddle factors. By direct inspection we 
establish that a final weight on base of for all X(A;) can be realized by twiddle factors of 
nodes only in the first row. 

Second, exhaustive search with SMT is used to prove that no hf\ exists with lower 
FLOP count that its 6/2. The SMT proof is a miter that includes 6/1 and 6/2. The 
6/2 side of the miter is a direct translation to SMT of the final weight on base computa- 
tions just seen. The hf\ side of the miter is created by substituting for all final weights 
{X{k)Wh = 0) in the expressions from Equation 3. Input values ndo,*.Wfe and row weight 
strides, nd\^^Ws^nd2^^Ws^ are common to 6/1 and 6/2. FLOP counts for 6/1 and 6/2 are 
individually and explicitly tallied within the SMT model. The question posed to the SMT 
solver is to find a 6/1 with lower FLOP count than 6/2. The theorem is proved for some n if 
the SMT solver returns unsatisfiable. The proof can be run once for every size-n FFT under 
consideration, or induction can be used to establish the result for n + 1 and higher. □ 

This theorem appears to conflict with decimation-in- frequency FFT algorithms, such as 
shown in Figure 3, where costly twiddle factors appear in the last two rows of the FFT 
flowgraph. Consider the size-4 butterfly 6/(1,0, 1,4) from Figure 3. There is multiplication 
cost at internal nodes n(i(2, 1,2) and n(i(2, 1,6). But the twiddle factor, n(i(2, 1, 2).a;g can 
be distributed and pushed up to the parent nodes n(i(4, 1,4) and n(i(4,3,4). Likewise, 
an cjg must also be factored out of n(i(2,l,6) to maintain algebraic correctness. Now all 
multiplication cost occurs in the top row of 6/(1,0, 1,4) and the total cost remains the same. 
Globally, there is now no size-4 terminal butterfly with cost in the last two rows. Note that 
for this small size-8 FFT, this new configuration of twiddle factors now fails conditions for 
partitioning by original butterflies as costly twiddle factors now occur in the top two rows. 
For this reason, combined original and terminal partitioning is only applicable to size-16 
and larger FFT flowgraphs. 

By Theorem 3.3 and the recursive structure of the FFT, we can now further partition 
the FFT flowgraph when solving for minimum FLOP count. In general, we must solve for 
all FFTs corresponding to / mod x* — i and x* + i branches in the factor tree, but now 
each branch can be partitioned into four smaller equally sized FFTs. For a size-n FFT, this 
requires solving SMT models for groups of 8 size-p butterflies for all p from 1 up to j|. In 
practice, for values of p = 8 the problem becomes trivial and that is used as the terminal case 
of partitioning. The most difficult partition of a size-n FFT flowgraph, a size-^ butterfly, 
will have a solution space of 232 ^°S2 T6{(iog2'T')-2)_ concrete terms, the largest SMT models 
required to solve a size-256 flowgraph are eight size-16 butterflies corresponding to the / 
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mod X — i and x + i branches of the factor tree. One of these size-16 butterflies has a 
solution space of 2^^^. 

We can now prove the surprising result that size-256 FFTs exists which require only 
6616 FLOPs, rather than the 6664 FLOPs required by the traditional split-radix, even 
when twiddle factors are of modulus one. Finding a 6616 FLOP algorithm requires 22 
seconds to compute when the lowest cost constraint is used for each partition. Just over 5 
seconds is required for the toughest size-16 partition. Of course, searching for the lowest 
cost in a partition requires repeated SMT runs and consequently the total search time is 
higher. To prove that no solution exists with FLOP count lower than 6616 requires 160 
seconds total, with the toughest partition requiring just over 50 seconds. 

3.7 Symmetry Reductions 

We find that there are many FFTs with equivalent final FLOP count yet with different 
twiddle factor values. Prior work in twisting[l][42] indicates that this should be expected. 
In this section, we highlight two types of symmetry reduction that reduce SMT run times. 
Many local symmetry reduction constraints are possible and we experimented with dozens 
but found only these two to be of any significance. 

3.7.1 3-NoDE Symmetries 

A size-2 butterfly is 3-node symmetric if 3 of its 4 nodes require 6 FLOPs for multipli- 
cation. Symmetries are eliminated by forcing ndi^.t/p to have no multiplication cost. 

Example 3.6. Consider a concrete computation performed by a size-2 butterfly from the 
size-32 FFT flowgraph expressed as 

^(0) = V'32(ao'032 + aiV'i2) (mod "A) 
^(1) = i'hiaoi^h + aii^h) (mod "/4). 

This butterfly requires 18 = 6 -|- 6 -|- 6 FLOPs for multiplication. If the V'32 is factored out 
to "zero" the weight on ai and shared twiddle factors are preserved, these equations can be 
expressed as 

X{0) = ^^i2(aoV'^2 + ai^/'^a) (mod 
X{1) = V'i2(aoV'32 + ai4^^2) (mod n/^)^ 

with total multiplication cost of 18 = 6 -|- 6 -|- 6 FLOPs again. Alternatively, if the ^^32 is 
factored out to "zero" the weight on ao, these equations can be expressed as 

^(0) = V'32(aoV'32 + aii'h) (mod "/4) 
X{1) = V'32(aoV'32 + aii^h) (mod "/4), 

with total multiplication cost of 12 = 6 -|- 6 FLOPs. For the values in this example we find a 
cost benefit from factoring out the ^^32- We must be pessimistic and assume the worse case, 
18 = 6 -|- 6 -|- 6 FLOPs, since only one weight is guaranteed zero-cost. It is also possible to 
"zero" the weight on the X{1) sum by distributing the V'32 achieve the same 12 FLOP 
configuration. 
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In the SMT model, 3-node symmetric size-2 butterflies are detected and only those with 
zero multiplication cost for ndi^ are allowed. This is built by defining the following illegal 
condition, 

ndifi.cG A {{ndofl.cQ A ndo,i-c6) V {ndofi.cG A ndi_i.c6) V (n(io,i-c6 A ndi^i.cG)), 

for each size-2 butterfly and then requiring the inverse be satisfied in the SMT model. 

We have verified with SMT-based proofs like those seen previously that this constraint 
doesn't increase the butterfly's FLOP count. As in the example, it may lead to a lower FLOP 
count if some node other than ndifl has an applied weight of zero. We have formulated 
more complex constraints to detect these better cases early but found negligible speed-up 
in SMT runs. Instead, we rely on the cost-constraint described in Section 3.2 to eventually 
eliminate bad choices. Finally, if the SMT solver happens to choose the better placement 
of zero applied weight to begin with, the node is not 3-node symmetric (multiplication cost 
is less than 16 FLOPs) and no 3-node symmetric constraint will apply. 

3.7.2 Bottom Equal- Pair Symmetries 

A size-2 butterfly has equal-pair symmetries if nodes ndi^Q and ndi^i have multiplication 
cost and equal twiddle factors, ndifi.tfp = ndi^i.tfp. This symmetry is eliminated by 
requiring that these identical twiddle factors in row 1 be distributed to row nodes of the 
butterfly. 

Example 3.7. Consider a concrete computation performed by a size-2 butterfly from the 
size-32 FFT flowgraph expressed as 

X{0) = V'i2(aoV'32 + aii^32) (mod n/^) 
X{1) = %^hM32 + aiip^2) (mod "/4). 

This butterfly requires 12 = 6 -|- 6 FLOPs for multiplication. If the '032 is distributed, these 
equations can be expressed as 

^(0) = '032(aoV'i2 + aiiph) (mod "/4) 
^(1) = ^32(aoV'i2 + aiV'i2) (mod "/4), 

with total multiplication cost of 12 = 6 -|- 6 FLOPs again. Another example with initial 
multiplication cost in row is 

X(0) = V^i2(aoV'i2 + 01^32) (mod "A) 
^(1) = + ai^/'32) (mod "/4), 

with total multiplication cost of 18 = 6 -|- 6 -|- 6 FLOPs. After distributing the 'i/'32) this 
becomes 

X(0) = V32(«0V'|2 + aiV'i2) 

X(l) = V§2(«oV'l2 + aiV'i2), 

with lower multiplication cost of 12 = 6 -|- 6 FLOPs. For the values in this example we find 
a benefit but note that the final FLOP count is never worse than the initial as proved with 
SMT. 
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In the SMT model, bottom equal-pair symmetric butterflies are not allowed. This is 
built by defining the following illegal condition, 

{^ndifi.cO) A {ndifi.tfp = ndi^i.tfp), 

for each size-2 butterfly and then requiring the inverse be satisfied in the SMT model. 

We have verified with SMT-based proofs like those seen previously that this symmetry 
reduction doesn't increase the butterfly's FLOP count. A similar constraint for top equal- 
pair symmetric butterflies can be formulated, and even applied in combination with the 
bottom equal-pair symmetric constraint with care, but we found negligible speed-up in 
SMT runs when doing so. 

These two symmetry reduction constraints now bring the total time for finding a 6616 
FLOP count solution for a size-256 FFT down to 8 seconds. To prove that no solution 
exists with less than 6616 FLOPs now requires 50 seconds. It is now possible to find a 
15128 FLOP count solution for a size-512 FFT in about 11 hours. We gave up on attempts 
to find solutions better than 15128 FLOPs after spending more than 14 days. There were 
four partitions for which we could not prove unsatisfiable when applying a FLOP count 
constraint of the "best found less one." From experience, we suspect that a 15127 FLOP 
solution is most likely unsatisfiable given the dramatic increase in SMT solver run times. 

4. Results and Experiments 

Table 1 summarizes our results for SMT-based search of various size FFT flowgraphs. For 
size-256 FFTs and larger, we see that algorithms with FLOP count lower than the traditional 
split-radix do exist even when all twiddle factors have modulus one. We also show FLOP 
counts for the traditional spit-radix and for the tangent FFT [32] [1] , where twiddle factors are 
scaled and hence not modulus one. As expected, the required SMT time quickly becomes 
intractable as larger FFTs are considered. Yet it is still instructive to consider FFTs of 
relatively small size as such FFTs appear in larger FFTs. Finally, we do not know the 
number of FFT algorithms meeting these minimum FLOP count constraints but do know 
that there are many. We did search for multiple solutions of a size-256 FFT flowgraph 
partition and found hundreds before terminating. These solutions have both different values 
and placement patterns for costly twiddle factors. 
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Table 1 . Lowest FLOP Counts Found by SMT Search 
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The times reported in Table 1 are for the FLOP bounds at the boundary between 
satisfiable and unsatisfiable. We search for this boundary using binary search akin to 
Newton's method. We start with the best known FLOP bound for that size and class of 
FFT found in the literature, and divide that by 2. If that is satisfiable, we consider that 
the best known FLOP count and repeat. But if it is unsatisfiable, we choose a new FLOP 
bound half way between the unsatisfiable FLOP bound and the last known satisfiable bound 
and repeat. A complete search does require more time than seen in Table 1, but we find 
that FLOP counts far away from the boundary are solved relatively fast, whether they 
are satisfiable or unsatisfiable. Only when the boundary is approached do times increase 
dramatically. Furthermore, by imposing a timeout, we can skew the search to approach the 
boundary from the satisfiable side, where FLOP counts are successively becoming lower. 
This improves overall search performance as proving satisfiable cases is generally less costly 
than proving unsatisfiable cases. 

The reduction in FLOP count of FFTs found by SMT search appears to accelerate for 
larger n when compared to the tangent FFT. Our size-256 solution has an advantage of 
48 FLOPs when compared to the traditional split-radix FFT, whereas the tangent FFT 
has an advantage of 112 FLOPs. At this size, our FFT provides 48/ii2 = 0.429 of the 
advantage of the tangent FFT. At size-512, this advantage is 240/320 = 0.75. It is unclear 
if this approaches the tangent FFT advantage asymptotically, eventually surpasses it, or 
degrades. We suspect that the opportunities for optimization may be increasingly richer as 
partition sizes and the number of costly twiddle factors that they contain grow. 

4.1 SMT QF_BV Solver Experiments 

The results reported so far have all been generated using the SMT solver Boolector[7]. In 
this section, we present results for various SMT solvers, and identify some SMT solver 
characteristics best suited for our problem. 

We use four representative benchmarks for our experiments. The first, Sz256_6616, is 
the hardest partition from a size-256 flowgraph with a 6616 FLOP bound and is known to be 
satisfiable. The second, Sz256_6615, is also the hardest partition from a size-256 flowgraph 
but with a 6615 FLOP bound and is known to be unsatisfiable. Likewise, the third and 
fourth benchmarks, Sz512_15128 and Sz512_15127, are the hardest partitions from a size- 
512 flowgraph with 6616 and 6615 FLOP bounds respectively. Only Sz512_15128 is known 
to be satisfiable. Whether benchmark Sz512_15127 is satisfiable is unknown, but we suspect 
it is unsatisfiable. 

For state-of-the-art SMT solvers, we use the top four SMT solvers in the QF_BV 
category, closed quantifier- free formulas over the theory of fixed-size bit vectors, from the 
SMT-2011 competition [9]: Z3[16], STP2[24], Boolector[7] as well as MathSat5[8] main and 
application configurations. We include two additional QF_BV solvers that performed well 
in earlier competitions: Beaver[31] and Yices[20]. For the SMT-2011 competition solvers, 
we used the binary executables and unmodified run scripts from the SMT-2011 competition 
web site[9]. For Beaver and Yices, we downloaded the latest available version from the web: 
Beaver 1.2.0.780 and Yices 2.0, build date of July 29, 2010, for x86_64-unknown-linux-gnu. 
Both Beaver and Yices were executed without any additional command line options. We 
updated our pretty printer to support SMT-LIB 2.0[49] for the four SMT-2011 competition 
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solvers. Beaver and Yices were given SMT-LIB 1.2 input. All experiments were run on a 
64-bit Intel Core 17 Linux machine. 

Results for our SMT solver experiments are shown in Table 2. We have ordered the 
results from best to worst performance on benchmark Sz256_6615, which we consider the 
most representative as all lowest FLOP searches must end with a proven unsatisfiable case. 
For this unsatisfiable case, all solvers perform in the same order of magnitude, with the worst 
performer requiring 2.5 x the amount of time as the best performer. For the satisfiable cases, 
we see a larger variation in performance due to the rich set of solutions that exist and the 
chances that a particular solver's search strategy will find one first. All SMT solvers reached 
the timeout of 24 hours without solving Sz512_15127. 
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Table 2. SMT Solver Performance 



For Beaver, the best performing SMT solver on Sz256_6615, we also varied the command 
line options to test their effect. The most noticeable differences, although minor, came from 
disabling optimizations. Table 3 summarizes our findings. Constant propagation appears to 
be the most effective for our application. It should prove beneficial to incorporate constant 
propagation at the high-level when we generate our initial SMT models. 
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Table 3. Beaver Optimization Options 



4.2 Bit-Blasting and SAT Solver Experiments 

A common trait of the three best SMT solvers for our problem. Beaver, STP2 and Boolector, 
is that they focus on bitvector problems. All three perform bit-blasting and then use a SAT 
solver back-end to solve a traditional SAT problem. Furthermore, they pay close attention 
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to the circuit structure when optimizing and bit-blasting. All three incorporate AIGs, And- 
Invert Graphs[36], and rewriting of AIGs. Beaver and STP2 use the ABC[43] library to 
facilitate this. Furthermore, Beaver employs the SAT solver nflsat[30], which operates on 
AIGs natively, as its default back-end. Since this circuit-centric approach works well for 
our problems, this section presents experimental data on bit- blasting flows and SAT solver 
performance when considered separately. 

For each of the three best SMT solvers for our problem we implemented four experi- 
mental bit-blasting flows. At a high-level, these four flows are: 

1. SMT solver circuit representation to CNF with ABC 

2. SMT solver circuit representation to CNF with ABC after ABC optimization for SAT 

3. SMT solver circuit representation to CNF with AIGER 

4. SMT solver circuit representation to CNF with AIGER after ABC optimization for 
SAT 

Since each SMT solver's native circuit representation is slightly different, we first stan- 
dardized all circuit representations to AIGs. Beaver incorporates ABC and can generate 
AIGs natively. We generated an AIG using the command line options beaver -no-solve 
-aig -aig-f ile=<f ile . aig> <file.smt>. STP2 also incorporates ABC but has no work- 
ing command line option to generate an AIG file. Since STP2 is distributed as source, we 
were able to add an option to generate an AIG output file of it's internal circuit representa- 
tion. Boolector has an option to dump expressions in BTOR format. We generated BTOR 
with that option and converted it to AIG using synthebtor -m <f ile .btor> <f ile .aig>, 
which is a tool provided with the Boolector package. Thus, the bit-blasted representation 
from all three solvers are standardized as AIGs. 

In flows 1 and 2, CNF is generated from AIG by ABC using write_cnf . In flows 3 and 
4, CNF is generated from AIG by using the tool aigtocnf , which is part of the AIGER 
package[3]. In flows 2 and 4, the ABC optimization for SAT command drwsat is executed 
3 times to generate a simplified AIG. 

We selected 7 SAT solvers that are readily available and either have performed well in 
recent SAT competitions[40] [12] or are used in the back-end for Beaver, STP2 or Boolector 
already. 

• glueminisat 2.2.5[45] 

• simplifying minisat 2.2.0[22] 

• cryptominisat 2.9.0[53] 

• precosat 5 70 [2] 

• lingeling 2 76 [2] 

• clasp 2.0.2[35] 

• nflsat 05 102009 [30] 
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Figure 9. SAT Solver Performance for Various Bit-Blasting Flows 
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Figure 9 shows the results for each of the seven SAT solvers with all 12 bit-blasting 
flows. The bit-blasting flows are keyed to the SMT front-end, (B=Beaver, L=Boolector, 
S=STP2), along with the route to CNF, 1-4. Results are grouped by SAT solver, and 
ordered from worst (top) to best (bottom) average performance. Within a SAT solver 
grouping, results are ordered again from worst to best performance. The reported times 
include format conversion and ABC optimization, if applicable. 

From these results, we see that the SAT solver glueminisat, the winner in the SAT 2011 
competition for the UNSAT application track, consistently performs the best. Overall, the 
back-end SAT solver choice is more significant than the SMT bit- blasting tool and/or path 
to CNF. Still, Beaver bit-blasting appears to provide a slight second-order advantage. It is 
also interesting to note that the SAT solver clasp, the winner in the SAT 2011 competition 
for the UNSAT crafted track, performed the worst. Furthermore, Beaver's default choice 
of a back-end SAT solver, nflsat with native AIG input, does not distinguish itself. 

For our problem, the data suggests that bit-blasting by Beaver with conversion to CNF 
by AIGER (flow B3) for input to the SAT solver glueminisat is the best choice. We applied 
this flow to the unknown problem Sz512_15127 but were still unable to produce a result 
after days of compute time. From this we conclude that the greatest advances in solving 
our particular problem will come from high-level insight, such as additional problem par- 
titioning and identification of new problem-specific constraints. Next, the choice of the 
underlying SAT technology will have some beneficial effect as SAT technology continues 
to improve. And finally, how we cast our problem as SAT, including initial specification, 
constant propagation, choice of bit-blasting and conversion to CNF, can be adjusted for 
further second-order improvements. 

4.3 SMT QF_LIA Solver Experiments 

If is unclear whether modeling our problem as QF_BV is the best choice. It is possible to 
model bitvector problems with other logics[26][57][6][4]. In this section, we present a first 
attempt to model our problem as QF LIA, following the techniques of Kim and Somenzi[26] . 

In their recent paper [26], Kim and Somenzi showed that some QF_BV problems could 
be cast as QF_LIA for improved SMT solver performance. The main idea of their casting 
is to detect overflow and underflow of integer operations and use ITE operators to enforce 
the modular arithmetic of QF BV within QF LIA. We have implemented this casting by 
adding underflow/overflow detection and correction for all (mod n) operations in Algorithm 
1. Table 4 summarizes our results for all QF_LIA solvers we could find that accept SMT- 
LIB 2.0. The benchmarks Sz64_1160, Sz64_1159, Szl28_2824, Szl28_2823, are of similar 
character to the set used in QF_BV experiments shown in Table 2 except that they are from 
considerably smaller (size-64 and size-128) FFTs. When attempting the same benchmarks as 
used in the QF_BV experiments, the timeout of 24 hours was reached in all cases. Clearly, 
we must improve out initial QF_LIA specification and/or the underlying QF_LIA solver 
technology to make QF_LIA solvers competitive with QF_BV solvers on our problem. 

4.4 Algorithm Design 

The FFTs found by SMT-based search and posted on our web site [2 7] are witnesses that 
FFTs with lower total FLOP count than the split-radix exist even when all twiddle fac- 



36 





Sz64_1160 

SAT time(.s) 


Sz64_1159 

UNSAT time(.s) 


Szl28_2824 

SAT time(s) 


Szl28_2823 

UNSAT time(s) 


Solver 


Z3 


3.8 X 10-2 


5.2 X 10-2 


1.0 X 10° 


2.0 X 10^ 


MathSATS app 


7.6 X 10-2 


1.9 X 10-1 


1.5 X 10^ 


Timeout 


MathSATS main 


5.5 X 10-2 


1.0 X 10-1 


1.0 X 102 


1.4 X 10^ 



Table 4. SMT QF_LIA Solver Performance 



tors have modulus one, but are not practical algorithms in their current state. FFTs in 
widespread use usually can be defined succinctly in mathematical terms which leads to very 
regular patterns of twiddle factors in the FFT flowgraph. It is possible to formulate SMT 
constraints that require various forms of regularity in any satisfying solution. For example, 
additional constraints can be formulated and added to the model which allow costly twiddle 
factors only at specified nodes in the graph. A tighter constraint might force specific nodes 
to have prespecified twiddle factor values. A more relaxed constraint might just impose a 
relationship, such as a stride, between pairs of twiddle factors. In this way, the techniques 
described in this paper can be extended to do practical FFT algorithm design at the expense 
of proven optimality. Although this is a topic for further research, we highlight a few early 
experiments here. 

The split-radix created by delayed twisting as described by Bernstein[l] and Mateer[42] is 
very succinct yet can be used to generate a rich family of highly regular split-radix algorithms 
simply by choosing different legal twisting coefficients, C- By examining the twiddle factor 
patterns generated by this algorithm, we determine that twiddle factors applied to ordered 
coefficients of a polynomial in the factor tree must have a constant stride (twisted), match 
constant values as seen in the classic decomposition (delayed twisting) , or combine these two 
cases (twisting to something other than x* — 1). With constraints formulated and applied 
to the SMT model that require this pattern of twiddle factors, we no longer find solutions 
with total FLOP count less than the split-radix for size-256 FFT fiowgraphs. We do find 
solutions with FLOP count equal to the split-radix as expected. This confirms the theorem 
by Mateer[42] that combinations of twisting, though very rich, will never lead to an FFT 
with FLOP count lower than the split-radix. Although the regularity imposed by twisting 
doesn't support our solutions, other types of regularity might. 

The tangent FFT[32][1] starts with a version of the conjugate split-radix FFT[33]. In 
this algorithm, twiddle factors occur as conjugate pairs, where the conjugate pair is either 
at the top or bottom of a size-2 butterfiy. The complex twiddle factors for a conjugate pair 
can be factored as 

cosa(l -|- i tan a), cos 7(1 -|- itan7). 

Since a and 7 are conjugate angles, we know that cos a = cos 7. Van Buskirk's trick[39], 
which is exploited in the tangent FFT, moves these real scaling factors so that their cost 
is absorbed by other multiplications. With constraints formulated and applied to the SMT 
model that require twiddle factors to occur globally as conjugate pairs, we no longer find 
solutions with total FLOP count less than the split-radix for size-256 FFT fiowgraphs. We 
do still find instances with FLOP count equal to the split-radix. It still may be possible to 
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find solutions where conjugate pairs occur locally in specific places such that optimizations 
similar to Van Buskirk's can be beneficially applied. 

An objective to minimize FLOP count is primarily academic given the capabilities of 
modern computing hardware. Other more practical objectives include enhancing precision 
or easing implementation. For example, avoiding twiddle factors where the real or imag- 
inary part is a number very close to zero may enhance the precision of the final result. 
Alternatively, restricting all twiddle factors to some limited set may ease implementation, 
and we can formulate a SMT model that does just that. There are size-32 FFTs that use 
just two non-trivial costly twiddle factors, plus the free multiplications by 1, —1, i or —i. 
The minimum FLOP count for these algorithms is high at 616 compared to 456 for the 
split-radix but there may be benefits of having to multiply by just a few constants, espe- 
cially in hardware implementations. If we increase the set of allowed non-trivial twiddle 
factors for a size-32 FFT to three, the minimum FLOP count is 536. For a size-64 FFT, we 
find a 2112 FLOP count solution that uses only non-trivial twiddle factor powers from the 
set {7,8,9}. Note that these twiddle factor powers include conjugates so that only three 
transcendental function computations or table look-ups are required. We have posted some 
examples of these FFTs on our web site[27]. 

5. Conclusions and Future Work 

This paper presented a Boolean Satisfiability-based proof of the lowest FLOP count required 
by FFT algorithms up to size-512 with flowgraphs isomorphic to those generated by common 
power-of-two FFTs, and where all twiddle factors are n*'^ roots of unity. Even with these 
constraints, we find FFTs requiring fewer FLOPs than the split-radix starting at size-256. 
At the core of this proof is a novel way to enumerate all FFTs realizable by a given flowgraph. 
Partitioning and symmetry reduction techniques are developed to make it possible to prove 
FLOP count bounds for larger size-512 FFTs. Finally, because the SAT-based formulation 
and search techniques are general, the paper introduced additional search objectives that 
mimic twiddle factor patterns from twisting, require conjugate twiddle factor pairs, and 
minimize the allowed values of twiddle factors. 

As seen from our experimental results, our biggest advances came from applying a high- 
level understanding of this problem to partition and detect symmetries in order to simplify 
the input for SMT and SAT solvers. We believe that more effort along these lines is a 
good direction for future work. In particular, work in symmetry detection and breaking to 
simplify SAT[51] [34] is of interest. Just as this work uses computer automation to search for 
graph isomorphisms in the CNF structure, we can do the same at the more abstract FFT 
flowgraph level. Although symmetry breaking at the CNF level can benefit our problem, 
we believe that more progress can be made by exploiting higher-level symmetries in our 
specific problem. The challenge for us is to find useful isomorphisms with regard to twiddle 
factors, as the FFT flowgraph is very regular and rich in self-similarity. All our effort to 
partition and detect symmetry has been through human observation, and assistance from 
computer search may lead to better techniques. 

We have cast finding FFT algorithms as a bitvector problem and have used SAT and 
SMT solvers in a stand-alone manner to find solutions. This raises two questions for future 
work. First, is QF_BV the best logic for this problem? Although we present preliminary 
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results when cast as QF_LIA in Section 4.3, there are still other casting techniques and 
logics to try[57][6][4][41]. Of particular interest to us is casting our problem to integer 
linear programming with the techniques presented by Brinkmann[6]. This may allow us 
to optimize larger problems, especially when optimality need not be proven. Second, can 
larger and more interesting instances of our problem be solved through tighter integration 
with SAT and/or SMT solvers? There are ideas for integrating optimization with SAT and 
SMT solvers[46][37][13]. Solvers such as STP2[24] are providing APIs for tighter integration 
of user's applications. These directions remain unexplored by us but may yield significant 
improvements. 

Besides the future work just described, we plan additional work in three more directions. 
First, Section 4.4 highlights FFT algorithm design possible with techniques described in 
this paper. We will study the applicability of our techniques to practical FFT algorithm 
design, with cost objectives ranging from improved precision to implementation on specific 
hardware[48]. Second, we seek to impose regularity on our lowest FLOP count solutions to 
determine if they can be described more traditionally as succinct algorithms. This should 
also help us better characterize the FLOP savings as the the size of the FFT increases. 
Finally, we hope to ease the current constraint that all twiddle factors are n*^ roots of 
unity, and thus incorporate optimizations similar to those in Van Buskirk's[39] algorithm 
and the tangent FFT[3'2][1] directly into our search. 
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